Introduction

AmpseqR is an R package for analysis of amplicon deep sequencing (AmpSeq) data generated on the Illumina platform. The pipeline offers various useful functions including Data pre-processing, Amplicon sequence variant (ASVs) estimation, Data post-processing, and Data visualization.

The html report include:




Citation

https://github.com/bahlolab/ampseqr



Data

The data used to generate this report is output by AmpSeqR, the run_data.rds data includes run_data.rds that contains marker_info, sample_manifest, demultiplexed, flt_reads, sub_reads, seq_tbl, seq_tbl_rm_indel, seq_tbl_end.



Quality check

Sample info

The dataset includes 40 samples.

Amplicon marker info

The dataset includes 5 amplicon markers.
Marker names: Chr03, Chr05, Chr06, Chr07, Chr08

Overview of read counts

The pipeline provides an example to pull out how many reads were retained at various points of the pipeline (Data pre-processing, Amplicon sequence variant (ASVs) estimation, and Data post-processing). As a final check, this can track the reads through the pipeline.


Overview of read counts


Filtering Steps Description
demultiplexed Read count after demultiplexing
flt_reads Read count after filtering the low quality score sequences
sub_reads Down sampling per sample per amplicon data to certain number reads
seq_tbl Read count after calling amplicon sequence haplotype, and remove the noise sequence
seq_tbl_rm_indel Read counts after integrating sequences vary only for indels
seq_tbl_end Read counts after removing failed samples and markers and analyzing sequences across the entire dataset


Read counts table

The table represents the summary of the read counts at various filtering steps of the pipeline.

read_counts_tab(run)


Read counts plot

The figure represents the proportion of sequences that are final retained from demultiplexed to seq_tbl_end.

read_counts_plot(run)


Followed we will look in more detail how many reads were dropped at various points of the pipeline.


Data pre-processing


Individual

The figure represents the remained reads per sample per marker after demultiplexing.

p <- run$demultiplexed %>% 
      filter(!(is.na(sample_id))) %>% 
      filter(!(is.na(marker_id))) %>% 
      ggplot(aes(sample, n, fill = marker_id)) +
      geom_col() +
      coord_flip() +
      ggtitle('Demultiplexing counts')+
      facet_wrap(~marker_id,nrow = 1)+
      theme_light()+
      theme(
        legend.position = "none",
        axis.text.x = element_text(angle = 90)
      )
ggplotly(p)


Entire dataset

The figure represents the summary of remained reads after demultiplexing by the amplicon marker.

p <- run$demultiplexed %>% 
      filter(!(is.na(sample_id))) %>% 
      filter(!(is.na(marker_id))) %>% 
      ggplot(aes(marker_id, n, fill = marker_id)) +
      geom_boxplot() +
      ggtitle('Demultiplexing counts')+
      labs(y="Number of reads")+
      theme_light()
ggplotly(p)


Demultiplexed sequence counts table

The table represents the remained reads per sample per marker after demultiplexing.

run$demultiplexed %>% 
      filter(!(is.na(sample_id))) %>% 
      filter(!(is.na(marker_id))) %>% 
      select(sample_id,marker_id,n,sample,info) %>% 
      DT::datatable(
        extensions = 'Buttons',
    options = list(dom = 'Blfrtip',
                           buttons = c('csv', 'excel'),
                           lengthMenu = list(c(10,25,50,-1),
                                             c(10,25,50,"All")))
      )


Filter and denoise reads


Individual

The figure represents the remained reads per sample per marker after filtering the low quality score sequences based on dada2::filterAndTrim function. Fail represents the failed reads. Pass represents the reads that passed the filter step.

p <-  run$flt_reads %>% 
      mutate(fail = n_in - n_out, pass = n_out) %>% 
      select(sample_id, marker_id, pass, fail,sample) %>% 
      pivot_longer(c(pass, fail), names_to = 'status', values_to = 'n') %>% 
      ggplot(aes(sample, n, fill = status)) +
      geom_col() +
      coord_flip() +
      facet_wrap(~ marker_id,nrow = 1,scales = "free_x") +
      ggtitle('Read quality filtering (DADA2)')+
      theme_light()+
      theme(
        axis.text.x = element_text(angle = 90)
      )
ggplotly(p)


Entire dataset

The figure represents the summary of remained reads after filtering the low quality score sequences by the amplicon marker. Fail represents the failed reads. Pass represents the reads that passed the filter step.

p <- run$flt_reads %>% 
      mutate(fail = n_in - n_out, pass = n_out) %>% 
      select(sample_id, marker_id, pass, fail,sample) %>% 
      pivot_longer(c(pass, fail), names_to = 'status', values_to = 'n') %>% 
      ggplot(aes(marker_id, n, fill = status)) +
      geom_boxplot(aes(group=status)) +
      ggtitle('Read quality filtering')+
      labs(y="Number of reads")+
      theme_light()
ggplotly(p) %>%
  layout(boxmode = "group")


Sequence counts table

The table represents reads went in (n_in) and how many reads made it out (n_out) after filtering the low quality score sequences.

run$flt_reads %>% 
      mutate(fail = n_in - n_out) %>% 
      select(sample_id, marker_id, n_in, n_out, fail,sample,info) %>% 
      DT::datatable(
        extensions = 'Buttons',
    options = list(dom = 'Blfrtip',
                           buttons = c('csv', 'excel'),
                           lengthMenu = list(c(10,25,50,-1),
                                             c(10,25,50,"All")))
      )


Amplicon sequence variant (ASVs) estimation & Data post-processing


Here, we check how many reads were dropped at Amplicon sequence variant (ASVs) estimation & Data post-processing steps.

Amplicon sequence variant (ASVs) estimation filters amplicon sequence haplotypes that are likely errors base on dada2 sample inference algorithm.

Data post-processing filters chimera sequences and maps amplicon sequence haplotypes to reference sequences to remove potentially erroneous amplicon sequence haplotypes and remove amplicon sequence haplotypes that do not meet thresholds (Parameters set before running AmpSeqR).

The amplicon sequence haplotype status:

Type Description
pass Read count that passed the filter step
low_sample_count Failed: The sample read count lower than the minimum number of read per sample per marker
low_asv_count Failed: The amplicon sequence haplotype count lower than the minimum number of read for each haplotype
low_asv_freq Failed: The amplicon sequence haplotype frequency lower than the minimum haplotype frequency
low_ident Failed: The amplicon sequence haplotype similarity (map to reference) lower than the minimum sequence similarity
low_ident_z Failed: The standardized amplicon sequence haplotype similarity (map to reference) lower than the minimum standardized sequence similarity
chimera Failed: The amplicon sequence haplotype is chimera reads
multiple Failed: multiple fail types


Individual

The figure represents the passed reads and failed reads (different types) per sample per marker after Amplicon sequence variant (ASVs) estimation & Data post-processing steps.

status_pal <-
    run$seq_tbl %>% 
    mutate(status = if_else(str_detect(status, ';'), 'multiple', status)) %>% 
    select(status) %>% 
    distinct() %>% 
    filter(status != 'pass') %>% 
    mutate(colour = scales::hue_pal()(n())) %>% 
    add_row(status = 'pass', colour = 'grey50') %>% 
    with(setNames(colour, status))

p <- run$seq_tbl %>% 
      mutate(status = if_else(str_detect(status, ';'), 'multiple', status)) %>% 
      ggplot(aes(sample, count, fill = status)) +
      geom_col() +
      scale_fill_manual(values = status_pal) +
      coord_flip() +
      facet_wrap(~marker_id,nrow = 1,scales = "free_x") +
      ggtitle('ASV status filtering')+
      theme_light()+
      theme(
        axis.text.x = element_text(angle = 90)
      )
ggplotly(p)
Entire dataset

The figure represents the summary of passed reads and failed reads (different types) after Amplicon sequence variant (ASVs) estimation & Data post-processing steps.

p <- run$seq_tbl %>% 
      mutate(status = if_else(str_detect(status, ';'), 'multiple', status)) %>% 
      ggplot(aes(status, count, fill = status)) +
      geom_boxplot(aes(group=status)) +
      scale_fill_manual(values = status_pal) +
      ggtitle('ASV status filtering')+
      facet_wrap(~marker_id,scale='free_x',shrink = FALSE)+
      coord_flip()+
      theme_light()+
      theme(
        axis.text.y = element_text(size=15)
      )
p


Sequence counts table

The table represents the summary of the amplicon sequence haplotype status after Amplicon sequence variant (ASVs) estimation & Data post-processing steps.

run$seq_tbl  %>% 
      select(sample_id, marker_id, count,status, ident, ident_z, sample,info) %>% 
      DT::datatable(
        extensions = 'Buttons',
    options = list(dom = 'Blfrtip',
                           buttons = c('csv', 'excel'),
                           lengthMenu = list(c(10,25,50,-1),
                                             c(10,25,50,"All")))
      )


Missingness (Dataset)


Missingness

The figure represents missing values present in the dataset.


Marker missingness

The table represents marker missingness in the dataset.


Passed samples & markers


Passed

  • A total of 37/(of 40) Samples remained for downstream analysis after filtering steps.

  • A total of 5/(of 5) Markers remained for downstream analysis after filtering steps.

Failed sample

Failed marker



Haplotype analysis

Read counts

The figure represents an overview of the read counts for amplicon markers of all samples in the dataset.

hap_read_counts_plot(run)


Within-sample diversity


The diversity of each sample by different amplicons. This is measured by the count of SNP-based haplotypes for all amplicons.


The number of unique haplotypes per amplicon per sample

The figure represents the unique haplotypes detected for each amplicon each sample.

unique_haplotypes_heatmap(run$seq_tbl_end)

The average number of unique haplotypes

The figure represents the average number of unique haplotypes detected for each amplicon in the dataset.

p <- run$seq_tbl_end %>%
    group_by(sample,marker_id) %>%
    summarise(n=n()) %>%
    ungroup() %>%
  ggplot(aes(x=marker_id, y=n, fill=marker_id)) +
  geom_boxplot()+
    geom_point(color="black", size=0.4, alpha=0.5,position=position_jitter(w=0.3,h=0.1),shape = 21,aes(label=sample)) +
    theme_bw() +
    xlab("Amplicon Marker")+
    ylab("Number of distinct haplotypes")+
  theme(
    axis.text.x = element_text(size=16),
    axis.text.y = element_text(size=16),
    axis.title.x = element_text(size=17),
    axis.title.y = element_text(size=17),
    strip.text = element_text(size=17),
    legend.text = element_text(size=15),
    legend.title = element_text(size=16),
    legend.position = "none"
  )
ggplotly(p,tooltip = c("sample","n"))


Haplotype diversity metrics of amplicon markers in dataset


Metrics Description
Haplotype the unique haplotypes detected in dataset
He expected heterozygosity
Size(bp) the amplicon length
SNPs the number of variable sites detected in dataset compared to the reference sequence
Ave_hap the average number of unique haplotypes detected for each amplicon in dataset
hapdiv_metrics(run$seq_tbl_end,run$marker_info)


Haplotype sequence


Haplotype sequence

The sequence represents the unique haplotype sequence detected for each amplicon in dataset. Labeled by haplotype name (haplotype frequency%).

haplotype_sequence_vis(run$seq_tbl_end,run$marker_info)
                                        20                  40                  60                  80                 100                 120        
                      '''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''        
    Chr03-1(38.5%)    ········A····-······T·A·········TTATAATTCCGCGC················GC··································TG···························    126    
    Chr03-2(34.6%)    ········A····-······T·A·········TTATAATTCCGCGC················GC··································CT···························    126    
    Chr03-3(15.4%)    ········A····-······T·A·········TTATAATTCCGCGC················TT··································TG···························    126    
    Chr03-4(11.5%)    ········G····C······G·G·········--------------················TT··································TG···························    113    
         Reference    ········A····-······T·A·········TTATAATTCCGCGC················GC··································CT···························    126    
                              
         Consensus    GGAAGGGGRGTCA+CCCTGCKGRGGAACGCGC++++++++++++++ACATGAAAGTGCACATKYTTGCGTTTGTTCATGTGTGCGTTTTGTTTATGTAYKAATGTTCTTGTCGTATGTCTGCTTATA    127    
                              
                              
                                        20                  40                  60                  80                 100        
                      '''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'        
      Chr05-1(12%)    -······GC······················································C···························TG····G···    100    
     Chr05-2(9.3%)    -······AT······················································A···························TT····G···    100    
     Chr05-3(9.3%)    ·······AT······················································A···························TT····G···    101    
       Chr05-4(8%)    -······GC······················································C···························CT····A···    100    
       Chr05-5(8%)    ·······GC······················································C···························CT····A···    101    
       Chr05-6(8%)    ·······GC······················································C···························TG····G···    101    
     Chr05-7(5.3%)    -······AT······················································C···························TG····A···    100    
     Chr05-8(5.3%)    -······GT······················································C···························CT····A···    100    
     Chr05-9(5.3%)    -······GT······················································C···························TT····G···    100    
    Chr05-10(5.3%)    ·······GT······················································C···························CT····A···    101    
    Chr05-11(5.3%)    ·······GT······················································C···························TT····G···    101    
    Chr05-12(2.7%)    -······AT······················································A···························CT····A···    100    
    Chr05-13(2.7%)    -······AT······················································C···························CT····A···    100    
    Chr05-14(2.7%)    -······GC······················································A···························TG····A···    100    
    Chr05-15(2.7%)    -······GC······················································C···························TG····A···    100    
    Chr05-16(2.7%)    ·······AT······················································A···························CT····A···    101    
    Chr05-17(2.7%)    ·······AT······················································C···························TG····A···    101    
    Chr05-18(2.7%)    ·······GC······················································C···························TG····A···    101    
         Reference    -······AT······················································A···························TT····G···    100    
                              
         Consensus    CCATATGRYATATCCTTTTTTTTTTCTTTGTCATGGGCAGGCTTTACCGCTATTTTGTGTATTMTTTTCGAGTATCTCTTCTTTCTTTTAAYKCCAARCAG    101    
                              
                              
                                        20                  40                  60                  80                 100                 120        
                      '''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|''''''        
    Chr06-1(27.7%)    ·····AA·············A·····C·······-······TG··G·····················A···················G···A-···G·G·····-·A···G··T······················    133    
    Chr06-2(12.8%)    ·····CA·············C·····C·······-······TT··T·····················A···················A···CT···T·A·····G·T···G··G······················    135    
     Chr06-3(8.5%)    ·····CA·············A·····T·······-······TG··G·····················A···················A···CT···T·A·····G·T···G··G······················    135    
     Chr06-4(8.5%)    ·····CA·············C·····C·······-······CG··T·····················A···················G···A-···G·A·····-·A···G··T······················    133    
     Chr06-5(8.5%)    ·····CA·············C·····C·······-······CG··T·····················A···················G···A-···G·G·····-·A···G··T······················    133    
     Chr06-6(8.5%)    ·····CA·············C·····C·······-······CG··T·····················G···················G···A-···G·A·····-·A···G··T······················    133    
     Chr06-7(8.5%)    ·····CA·············C·····C·······-······TT··T·····················A···················G···AT···T·A·····G·T···G··G······················    135    
     Chr06-8(4.3%)    ·····AA·············A·····C·······-······TG··G·····················G···················G···A-···G·A·····-·A···G··T······················    133    
     Chr06-9(4.3%)    ·····AC·············A·····C·······A······TT··T·····················A···················G···A-···G·G·····-·A···G··T······················    134    
    Chr06-10(4.3%)    ·····AC·············A·····C·······-······TG··G·····················A···················G···A-···G·G·····-·A···G··T······················    133    
    Chr06-11(4.3%)    ·····CA·············C·····C·······-······TT··T·····················A···················A···CT···T·A·····G·T···A··G······················    135    
         Reference    ·····AC·············A·····C·······-······TT··T·····················A···················G···A-···G·A·····-·A···G··T······················    133    
                              
         Consensus    ATACTMMTACGCTTAACTGCMCTCAAYAAGTTGC+TACTCCYKCGKTATGTACCTCTGCTAAATAGGRTGTACCTACAAATTGCTACRTTTM+TTTKCRCACCT+GWTTTRATKTACATTTTGCACGTTCCCATTT    136    
                              
                              
                                        20                  40                  60                  80                 100        
                      '''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''        
    Chr07-1(24.5%)    ·AC··C······TGC-·C··A···············A·G·······CACAC·AA····GG·GC··A·····A···A··A··AAG···A·A··AC·GC·G·····GCTA···TT·CT···    118    
    Chr07-2(20.8%)    ·AC··C······TGC-·C··A···············A·G·······CACAC·AA····GG·GC··A·····A···A··A··ATA···C·A··AC·AC·G·····GCTA···TT·CT···    118    
    Chr07-3(15.1%)    ·GA··G······AAC-·G··A···············A·G·······AAAGC·AG····GG·AC··A·····T···G··A··TTA···G·C··GA·AG·A·····AACG···TT·TT···    118    
     Chr07-4(9.4%)    ·AC··C······TGA-·G··G···············C·G·······AGCAG·AG····TA·GA··A·····A···G··G··ATA···C·A··AC·AC·G·····ACCC···CG·TG···    118    
     Chr07-5(7.5%)    ·GA··G······AAC-·G··A···············A·G·······AAAGC·AG····GG·GC··A·····A···A··A··ATA···C·A··AC·AC·G·····GCTA···TT·CT···    118    
     Chr07-6(7.5%)    ·GA··G······AAC-·G··A···············A·G·······AACAC·TG····GA·GC··C·····T···G··G··AAG···G·C··AA·AG·G·····ACCC···TG·AT···    118    
     Chr07-7(3.8%)    ·AC··C······TGC-·C··A···············A·G·······AGCAG·AG····TA·GA··A·····A···G··G··GTA···C·A··AC·AC·G·····ACCC···CG·TG···    118    
     Chr07-8(3.8%)    ·AC··C······TGC-·C··A···············C·G·······AGCAG·AG····TA·GA··A·····A···A··A··AAG···A·A··GC·GC·G·····ACCC···CG·TG···    118    
     Chr07-9(3.8%)    ·GA··G······AAC-·G··A···············A·A·······AACAC·TG····GA·GC··C·····T···G··G··AAG···G·C··AA·AG·G·····ACCC···TG·AT···    118    
    Chr07-10(1.9%)    ·AC··C······TGCA·C··A···············A·G·······CACAC·AA····GG·GC··A·····A···A··A··ATA···C·A··AC·AC·G·····GCTA···TT·CT···    119    
    Chr07-11(1.9%)    ·AC··T······TGCA·C··A···············A·G·······CACAC·AA····GG·GC··A·····A···A··A··ATA···C·A··AC·AC·G·····GCTA···TT·CT···    119    
         Reference    ·AC··C······TGC-·C··A···············A·G·······CACAC·AA····GG·GC··A·····A···A··A··AAG···A·A··AC·GC·G·····GCTA···TT·CT···    118    
                              
         Consensus    GRMTGBAGTGAAWRM+ASAGRTTAAGAAAGTCGAAGMTRATATTAAMRMRSAWRATGAKRARMTTMAAAAGWTAGRAARTGDWRCCAVTMAARMTRSTRAAAAGRMYVAATYKAHKGCC    119    
                              
                              
                                        20                  40                  60                  80                 100                 120                 140        
                      '''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'        
      Chr08-1(41%)    ····G···CCA··········T········A········-······················T···T········A·····A·-·························································    139    
    Chr08-2(35.9%)    ····G···CCA··········T········G········A······················T···T········A·····A·-·························································    140    
    Chr08-3(17.9%)    ····G···CCA··········T········A········-······················G···C········G·····T·G·························································    140    
     Chr08-4(5.1%)    ····A···---··········A········A········-······················G···C········G·····T·G·························································    137    
         Reference    ····G···CCA··········T········A········-······················T···T········A·····A·-·························································    139    
                              
         Consensus    CGGGRGGT+++TCCATCAGATWTGCTACTTRTAGAATGG+CCAAGACAAATGACTCATCTGTKTAGYAGTGGTTARAGTGTWC+AAGAGAGGCACTTATTGAGCGATTGGCAAATTTCCATATTAGCTACGAAGATGCTTA    141    
                              
                              


Haplotype sequence similarity

The figure represents the distinct haplotypes & SNP positions for each amplicon in dataset. The X-axis represents the SNP position and the Y-axis represents the haplotype name. The red color indicates that the haplotype contains SNPs at certain positions compared to the reference sequence.

hap_seq_sim(run$seq_tbl_end,run$marker_info)


Haplotype tree


This haplotype tree is created by integrating all major haplotype sequences (>50%) in each sample and perform multiple sequence alignment based on DECIPHER::AlignSeqs function to identify the similarity between samples (ape::nj).


No imputation

The figure represents the haplotype tree, the amplicon markers represent the border color, the haplotype names represent the fill color (Due to there are a lot of haplotype names, colors of different markers may repeat).

hap_tree_plot(run$seq_tbl_end)


Missing-data imputation

Since some samples are missing certain markers, the imputation are performed by replacing missing data within a variable by the mode (the most frequent value in each marker). The imputed values are colored in white.

hap_tree_plot_imputed(run$seq_tbl_end)

Principal component analysis(PCA)


Principal component analysis is performed by integrating all major haplotype sequences (>50%) in each sample and perform multiple sequence alignment based on DECIPHER::AlignSeqs function to identify and analysis sample genotypes.


No imputation

hap_pca(run$seq_tbl_end)


Missing-data imputation

Since some samples are missing certain markers, the imputation are performed by replacing missing data within a variable by the mode (the most frequent value in each marker).

hap_pca_imputed(run$seq_tbl_end)